The most under-used statistical method in corpus linguistics: multi-level (and mixed-effects) models

نویسنده

Stefan Th. Gries

چکیده

Much statistical analysis of psycholinguistic data is now being done with socalled mixed-effects regression models. This development was spearheaded by a few highly influential introductory articles that (i) showed how these regression models are superior to what was the previous gold standard and, perhaps even more importantly, (ii) showed how these models are used practically. Corpus linguistics can benefit from mixed-effects/multi-level models for the same reason that psycholinguistics can – because, for example, speaker-specific and lexically specific idiosyncrasies can be accounted for elegantly; but, in fact, corpus linguistics needs them even more because (i) corpus-linguistic data are observational and, thus, usually unbalanced and messy/noisy, and (ii) most widely used corpora come with a hierarchical structure that corpus linguists routinely fail to consider. Unlike nearly all overviews of mixed-effects/multi-level modelling, this paper is specifically written for corpus linguists to get more of them to start using these techniques more. After a short methodological history, I provide a nontechnical introduction to mixed-effects models and then discuss in detail one example – particle placement in English – to show how mixed-effects/multilevel modelling results can be obtained and how they are far superior to those of traditional regression modelling.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Parameter Estimation in Spatial Generalized Linear Mixed Models with Skew Gaussian Random Effects using Laplace Approximation

 Spatial generalized linear mixed models are used commonly for modelling non-Gaussian discrete spatial responses. We present an algorithm for parameter estimation of the models using Laplace approximation of likelihood function. In these models, the spatial correlation structure of data is carried out by random effects or latent variables. In most spatial analysis, it is assumed that rando...

متن کامل

Evaluating logistic mixed-effects models of corpus-linguistic data

Evaluating the performance of mixed-effects models on the data they are trained on leads to problems in estimating model goodness. Nonetheless, mixed-effects models are preferable for corpus data, where some items have many more observations than others, because not having random effects in the model can cause fixed-effects coefficients to be overly influenced by frequent items, which are often...

متن کامل

Allophone-based acoustic modeling for Persian phoneme recognition

Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...

متن کامل

Centralized Supply Chain Network Ddesign: Monopoly, Duopoly, and Ooligopoly Competitions under Uncertainty

This paper presents a competitive supply chain network design problem in which one, two, or three supply chains are planning to enter the price-dependent markets simultaneously in uncertain environments and decide to set the prices and shape their networks. The chains produce competitive products either identical or highly substitutable. Fuzzy multi-level mixed integer programming is used to mo...

متن کامل

Do We Need Discipline-Specific Academic Word Lists? Linguistics Academic Word List (LAWL)

This corpus-based study aimed at exploring the most frequently-used academic words in linguistics and compare the wordlist with the distribution of high frequency words in Coxhead’s Academic Word List (AWL) and West’s General Service List (GSL) to examine their coverage within the linguistics corpus. To this end, a corpus of 700 linguistics research articles (LRAC), consisting of approximately ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

The most under-used statistical method in corpus linguistics: multi-level (and mixed-effects) models

نویسنده

چکیده

منابع مشابه

Parameter Estimation in Spatial Generalized Linear Mixed Models with Skew Gaussian Random Effects using Laplace Approximation

Evaluating logistic mixed-effects models of corpus-linguistic data

Allophone-based acoustic modeling for Persian phoneme recognition

Centralized Supply Chain Network Ddesign: Monopoly, Duopoly, and Ooligopoly Competitions under Uncertainty

Do We Need Discipline-Specific Academic Word Lists? Linguistics Academic Word List (LAWL)

عنوان ژورنال:

اشتراک گذاری